Historically, researchers in the field have spent a great deal of effort tocreate image representations that have scale invariance and retain spatiallocation information. This paper proposes to encode equivalent temporalcharacteristics in video representations for action recognition. To achievetemporal scale invariance, we develop a method called temporal scale pyramid(TSP). To encode temporal information, we present and compare two methodscalled temporal extension descriptor (TED) and temporal division pyramid (TDP). Our purpose is to suggest solutions for matching complex actions that havelarge variation in velocity and appearance, which is missing from most currentaction representations. The experimental results on four benchmark datasets,UCF50, HMDB51, Hollywood2 and Olympic Sports, support our approach andsignificantly outperform state-of-the-art methods. Most noticeably, we achieve65.0% mean accuracy and 68.2% mean average precision on the challenging HMDB51and Hollywood2 datasets which constitutes an absolute improvement over thestate-of-the-art by 7.8% and 3.9%, respectively.
展开▼